Exploratory analysis of textual data streams
نویسندگان
چکیده
In this paper, we address exploratory analysis of textual data streams and we propose a bootstrapping process based on a combination of keyword similarity and clustering techniques to: i) classify documents into fine-grained similarity clusters, based on keyword commonalities; ii) aggregate similar clusters into larger document collections sharing a richer, more user-prominent keyword set that we call topic; iii) assimilate newly extracted topics of current bootstrapping cycle with existing topics resulting from previous bootstrapping cycles, by linking similar topics of different time periods, if any, to highlight topic trends and evolution. An analysis framework is also defined enabling the topic-based exploration of the underlying textual data stream according to a thematic perspective and a temporal perspective. The bootstrapping process is evaluated on a real data stream of about 330.000 newspaper articles about politics published by the New York Times from Jan 1st 1900 to Dec 31st 2015.
منابع مشابه
Identifying and Ranking the Important Textual and Paratextual Elements in Fiction Retrieval
Purpose: The purpose of this study is to identify the textual and paratextual elements in retrieving fiction from the readers’ perspective in order to provide the most appropriate access points for the readers and to improve access to fictions based on the readers’ needs. Method: The current research is an applied study in terms of purpose, applying a mixed method that was conducted using the ...
متن کاملInfluence of Stream channel morphology and in-stream habitats on fish community in Golestan province Streams
Four streams with different sizes were selected for studying the effects of environmental factors on fish assemblages using indirect (Detrended Correspondence Analysis, DCA) and direct (Redundancy Analysis, RDA) gradient analysis in Golestan province. DCA of presence-absence and relative abundance data showed well gradient and linear model of species variability. In the within-site RDA, environ...
متن کاملTVGraz: Multi-Modal Learning of Object Categories by Combining Textual and Visual Features
Internet offers a vast amount of multi-modal and heterogeneous information mainly in the form of textual and visual data. Most of the current web-based visual object classification methods only utilize one of these data streams. As we will show in this paper, combining these modalities in a proper way often provides better results not attainable by relying on only one of these data streams. How...
متن کاملExploratory Correlation Analysis
We present a novel unsupervised artificial neural network for the extraction of common features in multiple data sources. This algorithm, which we name Exploratory Correlation Analysis (ECA), is a multi-stream extension of a neural implementation of Exploratory Projection Pursuit (EPP) and has a close relationship with Canonical Correlation Analysis (CCA). Whereas EPP identifies ”interesting” s...
متن کاملA System for Keyword Search on Textual Streams
An increasing amount of data is produced in the form of text streams − these can be RSS news feeds, TV closed captions, emails, etc. We study the problem of answering keyword queries on multiple textual streams. We define the result of a keyword query inspired by previous work on keyword search on static databases. A result to a query is a combination of streams “sufficiently correlated” to eac...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Future Generation Comp. Syst.
دوره 68 شماره
صفحات -
تاریخ انتشار 2017